9 Empirical Distribution

1 Convergence of Empirical Distribution

Suppose X1,,Xni.i.dF, where F(x)=P(Xx) is an unknown c.d.f. We want to estimate F:R[0,1].
A natural estimator is the empirical distribution F^n:R×Ω[0,1]: Fn^(x)=1ni=1nI(Xix), where for ωΩ, I(Xix)(ω)={1,Xi(ω)x,0,otherwise.

  1. Xi(ω)xωXi1(,x].
  2. Note that E[I(Xix)]=P(I(Xix)=1)=P({ωΩ|Xi(ω)x})=P(Xix)=F(x)1.
    So by SLLN, xR, F^n(x)a.s.F(x). I.e. xR, P(limnF^n(x)=F(x))=1.
    If we expand the limit claim, ε>0,N(x,ω,ε), s.t. nN(x,ω,ε), |F^n(x,ω)F(x)|<ε. Here N depends on x, so is pointwise convergence.

One can obtain a stronger result:

Theorem (Glivenko-Cantelli)

Suppose X1,,Xni.i.dF. Then supx|F^n(x)F(x)|a.s.0. In other words P(limnsupx|F^n(x)F(x)|=0)=1.

If we also expand it, ε>0, N(ω,ε) s.t. nN(ω,ε), |F^n(x,ω)F(x)|<ε,xR. Here N does not depend on x, so is uniform convergence.

The following proof and discussions are inserted after later notes. Readers can skip this part for now.

Define Dn=supx|F^n(x)F(x)|.
So this theorem is equivalent to Dna.s.0. However today we only prove a weaker version: Dnp0.

Lemma1

X1,,XnF continuous. The distribution of Dn is the same for all continuous F.

2 Relation to Brownian Bridge Kernel

Recall Multivariate CLT: n([F^n(u1)Fn^(uk)][u1uk])=1na=1n[1{Uau1}1{Uauk}]dNk(0,Σ),
where Σ=(Σij)i,j=1,,K with Σij=Cov(1{Uui},1{Uuj})=E[1{Uui}1{Uuj}]E[1{Uui}]E[1{Uuj}]=min{ui,uj}uiuj.E[1{Uui}1{Uuj}]=P(Uui,Uuj)=min{ui,uj},E[1{Uui}]=P(Uui)=ui.
This is true for all k. So the RHS corresponds to a Gaussian process with the Brownian Bridge Kernel {Bbr(u),u(0,1)}.
Hence nDn=supu(0,1)n|F^n(u)u|dsupu(0,1)|Bbr(u)|.
We have another fact about Brownian bridge kernel:

Theorem (Kolomogorov-Smirnov)

P(supu(0,1)|Bbr(u)|>x)=2k=1(1)k+1e2k2x2.

The first term 2e2x2 alone is very accurate.

So if n is large, P(Dn>xn)2e2x2.
This can be used to find an asymptotic level α confidence interval for estimating F(u) simultaneously for all u. 2e2x2=ux=12ln2α,xn=12nln2α.